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Abstract 

The problem of identifying splice sites consists of two sub-problems: finding their boundaries, 
and characterizing their sequence markers. Other splicing elements — including, enhancers and 
silencers — that occur in the intronic and exonic regions play an important role in splicing activity. 
Existing methods for detecting splicing elements are limited to finding either splice sites or 
enhancers and silencers, even though these elements are well-known to co-occur. We introduce 
SeeSite, an efficient and accurate tool for detecting splice sites and their complementary exon 
splicing enhancers (ESEs). 

SeeSite has three stages: graph construction, finding dense subgraphs, and recovering splice 
sites and ESEs along with their consensus. The third step involves solving Consensus Sequence 
with Outliers, an NP-complete string clustering problem. We prove that our algorithm for this 
problem outputs near-optimal solutions in polynomial time. Using SeeSite we demonstrate that 
ESEs are preferentially associated with weaker splice sites, and splice sites of a certain canonical 
form co-occur with specific ESEs. 



1 Introduction 

Genes in eukaryotes typically consist of the protein-coding DNA sequences, called exons, which may 
be interrupted by stretches of non-coding DNA, called introns, which are spliced out of mRNA. 
This RNA splicing is a fundamental biological process that is dictated by sequence markers located 
at splice sites, or exon-intron boundaries. In addition to these splice sites, proximal sequences affect 
the splicing efficiency by recruiting helper proteins that have the effect of enhancing or silencing 
the splicing process [6]. The classic experimental approach to discover exons and their splice sites 
is to map Expressed Sequence Tags (ESTs) to the reference genome [T]. Whereas, RNA-seq is the 
modern approach for gene annotation with a greatly increased throughput, but has several more 
purposes beyond this single task p3]. However, the reduced read length and increased noise in 
this method yields false evidence for many non-existent splicing events, creating an opportunity for 
computational methods to identify novel splice sites [22} ITT]. 

The problem of identifying splice sites consists of two sub-problems: finding their boundaries, 
and characterizing their sequence markers. A marker is a substring of a RNA molecule that is 
recognized by a protein. Markers recognized by the same protein have similar nucleotide sequences. 
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We define a motif to be a string that represents the common nucleotide pattern recognized by the 
protein and thus, a motif is a representative string for a set of markers. Splice sites can be extremely 
degenerate and therefore, deviate quite dramatically from their motif. For example, mammalian 5' 
and 3' splice sites have the motifs (A/C)AG||GT and AG||GT(A/G)AGT, respectively where '||' 
marks the exon-intron or intron-exon boundary. Less than 5% of known splice sites match these 
canonical motifs perfectly [|6|. In fact, more than 60% of the remaining splice sites have at least 
3 mismatches from the consensus sequence. In such cases where the splice site marker is weak, 
additional sequence markers nearby serve as binding sites for enhancer or silencer proteins and 
hence, are control mechanisms for splicing |3j. 

There exists a number of methods that address the problem of identifying splice sites. These 
methods look for motif sequences exclusively at the exon-intron or intron-exon boundary. For 
example, TopHat [22] maps as many RNA-seq reads as possible to the genome forming exon islands, 
then analyzes the mapping results to identify the splice sites. MapSplice [23], HMMSplicer [8J and 
SpliceMap [2] improve upon this method by using a refined alignment algorithm for the RNA-seq 
reads. All these methods search for the splice sites at a specific position, which is given by the 
alignment of the reads. 

The methods previously discussed focus only on finding splice sites and largely ignored the 
detection of other splicing elements, such as enhancers and silencers that occur in intron (i.e. 
ISE and ISS) and exon (i.e. ESE and ESS) regions. Separate computational programs exist to 
detect these elements. These methods fix the motif sequence and search for it in the vicinity of 
a boundary. More specifically, they search for ^-mers that are statistically enriched in a set of 
cases versus controls (5j El HUJ H7J [27]. For example, RESCUE-ESE [9] is a well-known program 
that compares exons with weak splice sites to those with strong splice sites reasoning that those 
with weak splice sites will have more enhancers. Hence, existing computational approaches for 
identifying splicing motifs fall into two categories based on their aim: those that detect splice sites, 
and those that find splicing enhancers or silencers. 

While the detection of the splice sites and their complementary splicing elements have been 
studied and detected separately, co-occurring relationships between them are known to exist. For 
example, it has been demonstrated that evolutionary changes that weaken a splice site can be 
compensated by changes in the exonic splicing enhancer (ESE) or silencer (ESS) \13\ 124] . Further, 
this relationship is also illustrated by the fact that many alternative splice sites are weaker than 
constitutive sites \19\ 128]. In addition, beyond this general relationship of having a weak splice site 
that is complemented by an enhancer or silencer, there exists evidence for more definite relationships 
between these two factors. Xiao et al [24J demonstrated such a relationship by showing that intronic 
splicing enhancers (ISEs) show specificity for different classes of splice site motifs that contribute 
to exon definition. 

To the best of our knowledge there does not exist any well-established computational methods 
that detect both splice site motifs and other splicing elements, while simultaneously characterizing 
the relationship between them. We introduce SeeSite, a computational program that aims at 
detecting splice sites and discovering splicing enhancers in exonic regions. SeeSite involves three 
stages: graph construction, finding dense subgraphs, and recovering splice sites and ESEs along with 
their consensus sequences. The first two stages can be handled using standard methods. However, 
the third stage requires solving a string clustering problem in the presence of noise since a number 
of the splice sites can be highly degenerate. We formalize this clustering problem as the Consensus 
Sequence with Outliers problem. This combinatorial problem is NP-complete, so we have to settle 
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for heuristic methods to solve it. On the other hand, we show that choosing the parameters 
to our algorithm cleverly yields strong guarantees on the quality of the output. Specifically we 
show that our algorithm is an efficient polynomial time approximation scheme (PTAS) unless the 
noise completely overwhelms the signal. This allows us to cluster the motifs into different types 
(corresponding to different enhancer /silencer binding sites), not all of which need to be present in 
all input sequences in order to identify their consensus. We extend our theoretical findings by also 
giving a (less practical) PTAS for Consensus Sequence with Outliers without any restrictions on 
the input. 

This work describes an algorithmic framework to study the problem of splice site and splicing 
enhancer discovery in exonic regions. The resulting method, SeeSite, is robust to uncertainty in 
both the sequence and position of the splice sites and ESEs. Thus, it is suitable to finding both 
strong and weak canonical splice motifs and their variable proximal markers in the context of similar 
splice sites. SeeSite is one of the first computational tools that aim at detecting not only splice 
sites but accompanying ESEs. Hence, we believe it will be a valuable tool in elucidating splicing 
mechanisms in a variety of species. Our main contributions are summarized below: 

• We develop a theoretical framework for detecting splicing elements, and characterizing the 
relationship between them and splice sites. 

• We describe a polynomial-time algorithm for detecting both ESEs and splice sites. 

• We provide empirical evidence that certain splice sites co-occur with specific ESEs 

2 Consensus Sequence with Outliers 
2.1 Problem Formulation 

We are interested in detecting splice sites that have similar sequence markers, and among those 
that have degenerate splice sites we aim to detect ESEs. Given a set of possible ^-mers (i.e. possible 
markers), we would like to determine the consensus sequence (i.e. motif), and the subset of ^-mers 
that are the most degenerate. The following problem formally defines this task. 

Definition 1. (Consensus Sequence with Outliers) We denote d(x,y) to be the Hamming 
distance between the length-l sequences x and y. Given n length-t sequences S = {s\, . . . , s n } over 
a finite alphabet S and nonnegative integer k, the aim of the Consensus Sequence with Outliers 
problem is to find a consensus sequence, s, and subset, S* C S, where n— \S*\ = k and X^v*eS* ^(*' s ) 
is minimal. 

The problem is NP-hard [4j, however, it is amenable to efficient approximation algorithms 
that are able to work well in practice. Before defining these algorithms, we begin by giving some 
preliminary definitions. For a set S of length-^ sequences, we denote the consensus sequence of 
S as c(5) and define it to be equal to the sequence that is obtained by picking a most-frequent 
character at every position with ties broken arbitrarily. We note that the tie-breaking will not 
affect our arguments. We denote the sum Hamming distance between a single sequence s and a 
set of sequences S as d(S, s) = ^vteS* ^> s ) - ^ ne Consensus Sequence With Outliers problem can 
now be succinctly stated as follows: given a set S of sequences and integer k, the objective is to 
find a subset S* C S of size n* = n — k such that d(S* , c(S*)) is minimized. 
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Given a subset S* C S we can compute c(5*) in polynomial-time. If we are given c(S*) for 
the optimal solution S* (but not given S* itself) then we can recover S* from c(S*) and S in 
polynomial-time since S* is the set of the n — k sequences in S that are closest to c(S* ). Similarly, 
given any sequence x, we denote S x as the subset of S containing the n* sequences closest to x. By 
construction S x satisfies the following inequality: d(S',x) > d(S x ,x) > d(S x ,c(S x )) for any subset 
S' C S of size n*. 

2.2 Efficient Algorithms for Consensus Sequences with Outliers 

We give a heuristic algorithm for solving Consensus Sequence with Outliers based on random 
sampling. The algorithm has two parameters r and t. It picks r sequences S' = (V^s^, ■■■s' r ) from 
S uniformly at random (with replacement), and finds the consensus sequence corresponding to S' . 
It repeats this process t times and outputs the best consensus sequence found. The pseudocode for 
this algorithm is given in Algorithm [T] 

Algorithm 1 

Input: S, k, r, t. 

Output: a sequence s and subset, S* C S of size n — k. 
2: try t times: 

(a) : choose a random subset of S of size r, denoted by S' . 

(b) : let S max be a set of k sequences that largest Hamming distance from c(S'). 

(c) : let S* be equal to S with all the sequences in S m ax removed. 

(d) : keep track of c(S*) with minimum d. 

3: Return c(S*) with minimum d and the corresponding S* . 



2.2.1 Approximation guarantees. 

In this section we prove guarantees on the quality of the solution output by Algorithm [TJ when 
the parameters r and t are chosen appropriately. In particular we show that Algorithm [T] is an 
efficient polynomial time approximation scheme (EPTAS) for Consensus Sequence with Outliers if 
the data does not consist mainly of outliers. A polynomial time approximation scheme (PTAS) 
is an algorithm that for every e > runs in polynomial time and outputs a (1 + e)-approximate 
solution. Typically the running time upper bound, while polynomial for every fixed value of e, 
grows very rapidly as e tends to 0. If the exponent of the polynomial in the running time of the 
algorithm is independent of e then the PTAS is said to be an efficient PTAS (EPTAS). To prove our 
bounds we prove the following technical lemma, which states that if the sample S' was taken from 
an (unknown) optimal solution S* , rather than from the entire input set S, then in expectation 
c(S') is almost as good as the consensus sequence for the set S*. 

Lemma 1. For all e > and a, there exists a value of r such that the following holds: if S is a 
set of n length-i sequences over the alphabet S, where the size ofT, is equal to a, and S' is a subset 
of S of size r, (s^s^, ■■■s' r ), chosen uniformly at random, then E[d(S,c(S'))] < (1 + e)d(S,c(S)). 

Proof. We prove that there exists a r such that E[d(S,c(S'))} < (1 + 2e)d(S,c(S)). Applying this 
weaker inequality with e' = e/2 then proves the statement of the Lemma. We assume, without loss 
of generality, that c(S) is equal to , e < 1/16, and r > 8. We restrict interest to column i of S, 
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where < i < £, let di be the number of nonzero symbols in column i and let z% = n — d{. Observe 
that d(S,c(S')) is equal to the sum over i of the number of sequences s 6 S such that s[i] 7^ c(S')[i]. 
By linearity of expectation it is sufficient to prove that for every i we have c(5')[i])] < 

(l + 2e)di. 

First, we assume di is at most en. Let q be the probability that c(S")[z] 7^ 0. It follows that 
E[d(S[i], c(S')[i])] is at most di(l — q) + qn. We determine an upper bound on the probability q as 
follows: 

q < £ Q {di/nfil-dilny-* < 2 r {d % /n) x 

x=\r/2\ ^ ' x=\r/2] 

~ {dl/H) l-(di/n) • 

Since c/j/n < e < 1/16, we get: 

q < 2 r+1 {di/nfW < 2 r+1 e^ (di/n)^ 

< 2 r ^j .2(d,/n)^J=2(^/n)^ 4 J. 

It follows from the last inequality, and that r > 8, that g < 2(di/n) 2 . Hence, we obtain the 
following bound on E[d(S[i], c{S')[i})]: 

E[d(S[i\,c{S')[i])] < di{l - q) + qn < di + 2 (^j n < (1 + 2e)(U 

Next, we assume that di > en. We say that a symbol a £ £ is a good symbol if there are at least 
Zi — ne 2 sequences in S that have the symbol a at column i; any symbol that is not good is bad. 
If c(5")[i] is a good symbol then c(S")[i]) is at most di + ne 2 and hence, is at most (1 + e)di 

since di > en. Let p be the probability that c(S")[i] is a bad symbol then, E[d(S[i], c(S')[i])] is 
upper bounded by (1 — p)(l + e)di + pn. Lastly, we determine an upper bound on p to complete 
the proof. 

Let a be a bad symbol and p a be the probability that c(S")[i] is equal to a. We note that in 
order for c(S")[i] to be a, there has to be more positions equal to a than in S'[i]. Let X be the 
difference between the number of positions equal to a and the number of positions equal to in S'[i]. 
It follows that p a < Pv[X > 0]. Let Xj be an indicator variable which is 1 if s'j[i] is equal to a, -1 if 
it is equal to 0, and otherwise. Since a is a bad symbol, there are at least e 2 more positions equal 
to than positions equal to a in S'[i] and therefore, -E^j] = Pr[s^[i] = 0] — Pr[s^-[z] = a] < —e 2 . 
By linearity of expectation, we obtain E[X] = Y7- =l E[Xj\ < —re 2 . Using this inequality, we get 
Pr[X > 0] < Pr[X — E[X] > re 2 ]. Since the Xj variables are independent and difference between 
the upper and lower bound of Xj is 2, we can use Hoeffding's inequality to obtain the following 
bound. 

/_2r 2 e 4 \ /re 4 " 
Pr[X - E[X] > re 2 } < exp ( ) = exp 



By choosing r = min ^n, max ^ — — )8jJ, we get p a < Finally, we bound p as follows: 
P <^2pa < cr^r = e 2 . We can now use the upper bound on p and our assumption that di > en to 
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bound E[d(S[i],c(S')[i])]: 

E[d(S[i\,c(S')\i})} < (l-p)(l + e)d l +pn<(l + e)d i + e 2 n<(l + 2e)d i . 

□ 



□ 



If the number of outliers is small, then with reasonably high probability a small random subset 
of the input sequences will not contain any outliers. If the random sample does not contain outliers 
we can use Lemma [T] to tie the quality of the output solution with the quality of the optimum 
solution. Based on this intuition we can prove the following theorem. 

Theorem 1. There exists a randomized EPTAS for Consensus Sequence with Outliers for inputs 
when k < cn for c < 1. The algorithm runs in time ^l c y ■ f(€){n£)°^' o,nd outputs a (1 + e)- 
approximate solution with probability 1/2. 

Proof. The algorithm selects a value for r such that for a random subset S' of the unknown optimal 
solution S* the inequality E[d(S* , c(S'))] < (1 + ^)d(S* ,c(S*)) holds. It follows from Lemma [T] 
that this can be done so that r only depends on e. We show that a single iteration of the outer 
loop of Algorithm [T] with this choice for r yields a (1 + e)-approximate solution with probability 

(1 — c) r ■ /(e). Then setting t = O ( ^_ c y.f^ j yields the statement of the theorem. 

It remains to find a sufficient lower bound of the probability that the set returned by a single 
iteration of the outer loop of Algorithm [T] is a (1 + e)-approximation. Since k < cn, it follows that 
the probability that S' is taken from an (unknown) optimal solution S* is at least ( "~ cn ) r = (1 — c) r . 
If S' is taken from S* then by Lemma [l] we have that E[d(S* , c(5'))] < (1 + f )d(S*, c(S*)). By 
Markov's inequality [12, p. 311] the probability that d(S* ,c(S')) exceeds expectation by a factor 
at least 1 + | is at most j^t- Hence, with probability /(e) for some function / of e we have that: 

d(S*,c(S')) < (l + |) d(S*, c(S*)) • (l + |) , 

which is at most (1 + e)d(S* , c(S*)) when (|) 2 < |. In particular, this holds if e < 3, concluding 
the proof. □ □ 

We note that one would expect natural inputs to contain substantially fewer outliers than 
n/2, and that Markov's inequality is a very pessimistic bound for the probability of achieving 
expectation. Hence, it is likely that for reasonable inputs Algorithm [T] performs much better in 
practise than the proved bounds. In fact, on our synthetic data the algorithm vastly outperformed 
the theoretical bounds. 

Using Lemma[l]we can also get a simple deterministic PTAS (but not an EPTAS) for Consensus 
Sequence with Outliers without any assumptions on the relationship between k and n. Specifically 
we prove the following theorem. 

Theorem 2. There exists a PTAS for Consensus Sequence with Outliers. 

Proof. It follows from Lemma[T]that there exists an integer r such that if S' , the set of r of sequences 
chosen from S, is from an (unknown) optimal solution S* then E[d(S*, c(S'))] < (l + e)d(S*,c(S*)). 
Some subset S' of S* must achieve expectation. The algorithm guesses this set S' by trying all 
possible n r subset of S of size r. Let x = c(S'). The algorithm returns the set S x of the n* sequences 
closest to x. This set satisfies d(S x ,c(S x )) < d(S x ,x) < d(S*,x) < (1 + e)d(S*,c(S*)), concluding 
the proof. □ □ 
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3 System and Methods 



The basic algorithm behind the SeeSite software goes through the following phases: graph con- 
struction, identification of dense subgraphs, and recovery of splicing sites or ESEs. In this section 
we will discuss each of these steps in greater detail. The inputs to SeeSite are m the minimum 
subgraph size, the maximum number of total mismatches d, the number of outlier sequences k, and 
b the maximum number of mismatches for the existence of an edge. In addition, there is the option 
to restrict the search to ESEs or splice junctions that have a specific canonical form. 

3.0.2 Graph Construction 

We construct a graph from the data, with each vertex representing an £-mer. The goal of the 
construction is to ensure that dense subgraphs correspond to closely related subsequences. The 
dense subgraphs will then be passed on to later stages of the algorithm for further consideration. 
For the remainder of this section we assume that all input sequences have length at most L. We 
now give a formal definition of the constructed graph. 

1. The vertex set contains a vertex Vij representing the /-length subsequence in sequence i 
starting at position j, for each i and j = 1, 2, . . . , L — I + 1. There are at most n{L — I + 1) 
vertices. 

2. Each pair of vertices Vij and Vi>j>, for i ^ i' is joined by an edge when the Hamming distance 
between the two represented subsequences is at most b. 

This graph is represented by a symmetric adjacency matrix, where each entry is for a non-edge, or 
a positive weight for an edge. We reduce the running time of searching G by considering subgraphs 
of G, {Gq, G\, . . . , Gl-i}, where Gi is the subgraph induced by vertex a reference vertex, denoted 
as VR t i , and its neighbors (for some arbitrary choice of reference sequence R) . We note that similar 
graph constructions have been used by Yang and Rajapakse [26J, Pevzner and Sze [15], and Yang 
et al. [25]. 

3.0.3 Detection of Dense Subgraphs 

We implemented a modified version of the MCQ algorithm of Tomita and Seki [21] to enumerate 
all dense subgraphs of size at least m. Each dense subgraph represents a motif instance. We chose 
MCQ due to the experimental work showing that when compared with other existing algorithms, 
it is the most efficient and practical for dealing with large graphs [20]. The underlying idea of 
this branch-and-bound, depth-first search method is to begin with a small dense subgraph (i.e. 
single vertex), add a vertex to it if and only if it is connected to some minimum number of vertices 
already contained in the subgraph, and halt when no other vertices can be added. If the subgraph 
has size at least m then it is returned. Each of the vertex sets outputted by the the MCQ algorithm 
corresponds to a set of £-mers that needed to be considered further in the next stage of the algorithm. 

3.0.4 Recovery of Splice Junctions and ESEs 

Each of set of i-mers identified in the previous step represents an instance of the Consensus Sequence 
with Outliers problem. Hence, Algorithm [T] is used to distinguish between sets of i-meis that 



7 



represent splice junctions or ESEs, or spurious patterns in the input data. We note that the size 
of the set of ^-rners (parameter n in Consensus Sequence with Outliers problem formulation) is at 
least m. The output is the set of all valid splice junctions, their consensus sequence, and outliers, 
or the set of all valid ESEs. We will see in the next subsection how this step can be adapted to 
detect both these splicing elements. 

3.1 Detecting Co-occurring Splice Sites and ESEs 

We run SeeSite in a two- fold manner in order to detect the splice junctions and ESEs associated 
with weak splice sites. First, SeeSite is ran to detect all possible splice junctions. In this first 
run of SeeSite, Algorithm [T] is used (with appropriate values of r and t) to determine the set of 
^-mers that correspond to sets of splice junctions; the consensus sequence and outliers for each set 
of splice junctions are also outputted at this stage. The outliers correspond to splice junctions that 
are highly degenerate. The exon regions corresponding to a set of outliers are then input into the 
second run of SeeSite. In this second run, the ESEs are detected in each of these groups of exons. 
Running SeeSite in this two- fold manner allows us to detect both strong and weak splice junctions 
along with the ESEs associated with weak splice junctions. 

As previously mentioned, the Consensus Sequence with Outliers problem formulation can be 
adapted to tailor the search for splice sites or ESEs. When searching for splice sites in the first run 
of SeeSite, we set k, the number of outliers, equal to [n/4\ , where n is the number of ^-mers in the 
motif instance. When searching for ESEs in the second run of SeeSite, the number of outliers is 
equal to zero, that is Algorithm [T] will return the majority sequence. 

4 Investigation of Splice Sites in the Human Genome 

The human genome has many multi-exon genes with excellent EST coverage and high-quality 
annotation. Thus, it provides a good source of known splice sites on which we can evaluate SeeSite's 
ability to detect canonical and non-canonical motifs as well as their complementary ESEs. Our 
benchmark dataset consists of 10,000 known splice sites from the human genome (hgl9 assembly) 
and its reference annotation (RefSeq). To capture ESEs, we extracted two lOObp subsequences 
centered at the 5' and 3' splice sites flanking each known intron. 

4.1 Dectection of Canonical and Non-Canonical Splice Sites 

We ran SeeSite with the minimum size of the subgraph (parameter m) equal to 100, and varied the 
values of £, d and b. The parameter n, which is the number of £-mers that need to be considered 
in the third step of the algorithm, is at least m, and the maximum number of outlier sequences 
(parameter k) equal to \n/A\ . For each set of splice sites corresponding to one consensus sequence, 
there exists a (possibly empty) set of outlier sequences. SeeSite was capable of detecting splice sites 
in 9,208 of the 10,000 genes considered, 87% of these sites overlapped with known gene models that 
have been verified by ESTs. 

One of the main advantages of SeeSite is the accuracy in detecting weak splice sites. A metric, 
referred to as the consensus value (CV), is used to gauge the degeneracy (or strength) of the splice 
site and is a index that ranges from 100 (perfect consensus) to (worst consensus) [HI [27J. Of the 
9,208 sites detected by SeeSite, less than 5% had a CV greater than 80, and more than two thirds 
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Splicing Pattern 


ESEs with High Correlation 


ESEs with Low Correlation 




AAAAGA (24%) 


ATGGCG 


GT(G/A) AGT (T/C)AG 


AAGAAT (21%) 


CAAGAT 




GAAAAT (17%) 


ATGAGA 




TGGAAA (16%) 


ATGAGA 




ATGAGA (33%) 


AAAAGA 


GCA (T/A) G (G/T) (T/A) AC 


TACAGA (34%) 


GAAAAT 






ATGGAA 






CAAGAT 




ATGGAA (41%) 


AAAAGA 


ATG(C/A)T(G/A) || (G/T)AC 


CAAGAT (23%) 


GAAAAT 






TACAGA 






ATGAGA 



Table 1: Some examples that illustrate the relationship between co-occuring splice sites and ESEs 
that control splicing expression. We searched for all possible ESEs in over 3000 weak splice sites of 
the form GT-AG, over 500 weak splice sites of the form GC-AC, and over 100 weak splice sites of 
the form AT- AC. We witnessed the existence of a number of ESEs that are either likely or unlikely 
to occur in the presence of splice sites that have a specific pattern. For each splice site pattern we 
listed the ESEs with high correlation and low correlation. Those with low correlation occurred in 
less than 10% of the exons associated with the corresponding splicing pattern. The percentage of 
occurrence is given in brackets for each of the ESEs with high correlation. 

had a CV less than 77. The splice sites that were identified as "outliers" by SeeSite had a mean 
CV of 68, with the variance of the distribution of the CVs being 2.5. 

SeeSite detected all splice sites with well-known canonical forms, as well as, identified non- 
canonical sites. For example, the consensus splice site pattern that has GT(G/A)AGT for the first 
6bp of the intron and (T/C)AG as the last 3bp of the intron is well-known canonical splicing form 
in Homo sapien data. Although these splice sites are highly degenerate and leave GT-AG as the 
only reliable splice pattern in the Homo sapien data [THj, SeeSite, TopHat [22] and HMMSplicer 
[8] were capable of detecting majority of these sites. SeeSite identified the GT-AG, GC-AG, and 
AT- AC splicing site patterns in 87%, 3%, and 0.5% of the genes considered. Of the GT-AG splicing 
sites identified, 2% matched perfectly to the consensus (CV of 100), 4% had a CV between 90 and 
80, 26% had a CV between 80 and 77, and the remaining 68% had a CV less than 77. 

4.2 ESEs are Paired with Weaker Splice Sites 

Whereas most past computational methods search for either splicing site sites or ESEs, we sought 
pairs of patterns that demonstrate an unusually strong tendency to co-occur across exons. In order 
to accomplish this task, we ran SeeSite on each set of outlier sequences with minimum subgraph size 
(parameter m) equal to 50, varied values of d and b, I equal to 5 and 6, and k = 0. The largest set of 
outlier sequences was 198. We validated our findings by comparing the ESEs found by SeeSite with 
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those identified by RESCUE-ESE [9jj Of the sequences found at this stage by SeeSite, 95% were 
identified by Fairbrother et al [9] as being an ESE. A summary of our results are found in Table 
[TJ which shows a pairing of ESEs and specific splice site patterns. Our results give strong evidence 
toward the existence of co-occurring splicing elements — that is, a pairing of splice sites with specific 
ESEs. For the weak splice sites of each form, GT-AG, GC-AC, and AT-AC, we determined the 
ESEs that are occur most frequently and those that occur the most infrequently and compared 
these across different splicing sites. We witnessed that the occurrence of several ESEs have high 
correlation to the corresponding splicing site having a specific pattern (i.e. GT-AG, GC-AC, and 
AT-AC), while others have a low correlation. Hence, there exists strong evidence for a pairing of 
ESEs to splicing sites. 

In addition, we considered the number of exons that are associated with strong splice sites sites 
and that are paired with an ESE, with the number of exons that are associated with weak splice 
sites sites and that are paired with an ESE. We found that 90% of weak splice sites are paired 
with an ESE, as opposed to 30% of strong splice sites. In this context, we refer to a splice site as 
strong if the CV is greater than or equal to 85. This statistic supports the ongoing conjecture that 
co-occurring pairs are contributing to splicing by compensating for a lack of strong splicing signals. 

5 Conclusion and Future Work 

SeeSite is a computational tool for detecting splice sites and ESEs, and identifying co-occuring rela- 
tionships between these sites. Our results suggest the existence of several non-canonical splice site 
patterns and demonstrates a possible synergistic relationship between ESEs and different classes 
of splice site patterns. Future experimental work is needed to resolve whether these relationships 
between splicing elements, and non-canonical splice sites are biologically significant or simply spu- 
rious correlations detected in the data. However, we believe this is unlikely given the strength and 
frequency of a number of the patterns. Determining the exact biological relationship of these paired 
splicing elements, i.e. they could act negatively to promote exon skipping, whether these pairs are 
equally conserved over evolution warrants further investigation. 
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